Minimal Data Copy for Dense Linear Algebra Factorization
نویسندگان
چکیده
We describe a new result that shows that representing a matrix A as a collection of square blocks will reduce the amount of data reformating required by dense linear algebra factorization algorithms from O(n) to O(n). 1 Description of a Fortran and C Inefficiency for Dense Linear Agebra Factorizations The current Most Commonly Used (MCU) Dense Linear Algebra (DLA) algorithms for serial and SMP processors have a performance inefficiency and hence they give sub-optimal performance. We show that standard Fortran and C two dimensional arrays are the main reason for the inefficiency. We show how to correct these performance inefficiencies by using New Data Structures (NDS) along with so-called kernel routines. The NDS generalizes the current storage layouts for both the Fortran and C programming languages. The BLAS(Basic Linear Algebra Subroutines) were introduced to make the algorithms of DLA performance-portable. However, a relationship exists between the Level 3 BLAS used in most of level 3 factorization routines. This relationship introduces a performance inefficiency in block based factorization algorithms and we will now discuss the Level 3 BLAS, DGEMM (Double precision GEneral Matrix Matrix) to illustrate this fact. In [5, 2] design principles for producing a high performance “Level 3” DGEMM BLAS are given. A key design principle for DGEMM is to partition its matrix operands into submatrices and then call an L1 kernel routine multiple times on its submatrix operands. Another key design principle is to change the data format of the submatrix operands so that each call to the L1 kernel can operate at or near the peak Million FLoating point OPerations per Second (MFlops) rate. This format change and subsequent change back to standard data format is a cause of a performance inefficiency in DGEMM. The DGEMM interface definition requires that its matrix operands be stored as standard Fortran or C two-dimensional arrays. Any DLA factorization algorithm (DLAFA) of a matrix A calls DGEMM multiple times with all its operands being submatrices of A. For each call data copy will be done; the principle inefficiency is therefore multiplied by this number of calls. However, this inefficiency can be eliminated by using the NDS to create a
منابع مشابه
Scheduling dense linear algebra operations on multicore processors
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workl...
متن کاملAutomatically Tuned Dense Linear Algebra for Multicore+GPU
The Multicore+GPU architecture has been adopted in some of the fastest supercomputers listed on the TOP500. The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures processors like Multicore+GPU. However, to provide portable performance, manual parameter tuning is required. This paper presents automatically tuned LU factorizat...
متن کاملNew Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the u...
متن کاملDense and Iterative Concurrent Linear Algebra in theMulticomputer Toolbox
The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We discuss uniform calling interfaces and functionality for linear algebra libraries. We include a d...
متن کاملMinimizing Communication in Numerical Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006